Module 02 - Python Observability

What You Will Learn

By the end of this module you will be able to:

Explain the three pillars of observability and why each one is irreplaceable
Configure production-grade structured logging with structlog and correlation IDs
Expose Prometheus metrics from a FastAPI service and write real PromQL queries
Instrument a Python microservice with OpenTelemetry and visualise traces in Jaeger
Capture, group, and alert on production exceptions with Sentry
Build health check endpoints that Kubernetes actually trusts

Prerequisites

Requirement	Why It Matters
Python 3.11+	`contextvars`, `asyncio`, type hints used throughout
FastAPI basics	All production examples use FastAPI
Docker + docker-compose	Every tool in this module runs locally via compose
Module 1 complete	Async patterns and profiling context assumed
Basic SQL / PostgreSQL	Incident examples reference `pg_stat_activity`

The Incident That Starts Every Observability Story

It is 14:23 on a Tuesday. Requests per second on your Python API are normal. HTTP 200s are flowing. No exceptions in Sentry. No alerts in PagerDuty. But your product manager has just forwarded a screenshot from a paying customer: every action in the app takes 8–12 seconds instead of the usual 400ms.

You open your logs:

INFO:uvicorn.access: 200 POST /api/documents 11432ms
INFO:uvicorn.access: 200 POST /api/documents 9871ms
INFO:uvicorn.access: 200 POST /api/documents 12103ms

The service is returning 200 OK. Latency is terrible. Logs show nothing useful - no query, no user, no context. You have no metrics so you cannot see when it started or whether it is getting worse. You have no traces so you cannot see where the 11 seconds are actually going.

Four hours later, after crawling through application code and guessing, someone runs this on the database host:

SELECT count(*), state, wait_event_type, wait_event
FROM pg_stat_activity
WHERE datname = 'myapp_prod'
GROUP BY state, wait_event_type, wait_event
ORDER BY count DESC;

 count | state  | wait_event_type | wait_event
-------+--------+-----------------+------------
    48 | active | Lock            | relation
     2 | idle   |                 |
     0 | ...

Connection pool exhaustion. The application pool was set to 10 connections. Under load it queued requests waiting for a free connection, each waiting up to the 30-second timeout. The service never returned an error because the requests eventually succeeded - just 11 seconds late.

Four hours of debugging for a problem that a single Prometheus gauge would have surfaced in four seconds.

This module is about never having that four-hour incident again.

Why Observability is Not Just Logging

Most engineers learn to "add logging" and consider observability done. That mental model breaks in production. Here is why the three pillars are each irreplaceable:

┌─────────────────────────────────────────────────────────────────┐
│                    OBSERVABILITY STACK                          │
├─────────────────┬──────────────────┬────────────────────────────┤
│     LOGS        │     METRICS      │        TRACES              │
│                 │                  │                            │
│  "What happened │  "How much /     │  "Where did the time go?"  │
│   and when?"    │   how often?"    │                            │
│                 │                  │                            │
│  Discrete       │  Aggregated      │  Causal chain across       │
│  events with    │  numerical       │  services, showing         │
│  context        │  measurements    │  parent-child timing       │
│                 │  over time       │                            │
│  structlog      │  Prometheus      │  OpenTelemetry + Jaeger    │
│  Loki           │  Grafana         │                            │
│  Datadog Logs   │  Alertmanager    │                            │
├─────────────────┴──────────────────┴────────────────────────────┤
│                   ERROR TRACKING                                │
│         "Which exceptions are happening, how often,            │
│          and what is their full context?"                       │
│                        Sentry / GlitchTip                       │
├─────────────────────────────────────────────────────────────────┤
│                   HEALTH CHECKS                                 │
│         "Is this service safe to receive traffic right now?"    │
│                  /liveness  /readiness  /startup                │
└─────────────────────────────────────────────────────────────────┘

Logs: What Happened

A log is a discrete event record. It has a timestamp, a severity level, a message, and ideally a rich set of structured key-value context fields. Logs answer questions like:

"Which user triggered this error?"
"What SQL query ran before the exception?"
"What was the document_id being processed when the worker died?"

Logs are high cardinality - you can store as much context per event as you need. They are bad at answering "how often does this happen?" because that requires reading and counting many log lines.

Metrics: How Much / How Often

A metric is a numerical measurement aggregated over time. It answers:

"How many requests per second are we serving right now?"
"What is the p99 latency of the /api/classify endpoint?"
"How many database connections are currently in use?"

Metrics are low cardinality - you cannot store per-user data in a Prometheus label without blowing up cardinality. They are great for alerting because they are pre-aggregated and cheap to query.

Traces: Where Did the Time Go

A trace is a causal chain of timed operations across a distributed system. A single user request might touch an API gateway, two microservices, a database, Redis, and an external LLM API. A trace shows you:

The exact wall-clock time each service spent on the request
Which service was the bottleneck
The gaps between services (network, queues, serialisation)
Whether a slow downstream dependency caused a cascade

Traces answer the question that logs and metrics cannot: "The request took 800ms total - where did that time go?"

The Mistake: Thinking One Pillar Is Enough

Scenario	Logs alone	Metrics alone	Traces alone
High p99 latency	See individual slow requests but no pattern	See the spike but not why or which service	See the bottleneck if you have a trace
Exception spike	See exceptions with context	See the rate spike but no context	Traces show span errors but not exception details
Connection pool exhaustion	See timeout errors but not pool state	See the gauge and alert immediately	Not directly visible without custom spans
Which user was affected	Yes, if logged	No - metrics are aggregated	Yes, if user ID in span attributes

You need all three. They are complementary, not redundant.

The print() Problem

Every Python developer starts with print(). Here is what is wrong with it in production:

# What most beginners write
print(f"Processing document {doc_id}")
print(f"Error: {e}")

# What production requires
import structlog
log = structlog.get_logger()

log.info(
    "document.processing.started",
    document_id=doc_id,
    user_id=current_user.id,
    file_size_bytes=doc.size,
    content_type=doc.content_type,
)

The difference is not cosmetic. With print():

There is no timestamp (or it is not machine-parseable)
There is no severity level - you cannot filter for errors only
There is no structured data - you cannot query document_id = "abc123" in Kibana
It goes to stdout with no buffering control - under load it will block your event loop
You cannot route it to different destinations (file, syslog, log aggregator)
You cannot suppress it in tests without redirecting stdout

With structured logging, a log line becomes a queryable document:

{
  "timestamp": "2026-03-07T14:23:01.234Z",
  "level": "info",
  "event": "document.processing.started",
  "document_id": "doc_8f3a2c",
  "user_id": "usr_99f1b4",
  "file_size_bytes": 204800,
  "content_type": "application/pdf",
  "service": "document-api",
  "version": "2.14.0",
  "environment": "production",
  "request_id": "req_7e9d3b"
}

That single line can be searched, aggregated, alerted on, and correlated with traces - automatically.

A Metric Is Not a Log

A common mistake is trying to use logs as metrics:

# Wrong: trying to use a log query as a metric
log.info("cache_miss", key=cache_key)
# Then querying: count(event="cache_miss") per minute in Kibana

This works at small scale. At production scale:

Log ingestion has latency - your "metric" lags 30–60 seconds
Log storage is expensive - you are paying per GB for numerical data
Log queries are slow - COUNT queries on log indices are full scans
Log cardinality is unlimited - one bad log statement with a UUID label creates billions of series

The right solution: a Prometheus counter.

from prometheus_client import Counter

cache_misses = Counter(
    "cache_misses_total",
    "Total cache misses",
    ["cache_name", "operation"],
)

# In your cache layer:
cache_misses.labels(cache_name="document_cache", operation="get").inc()

Now rate(cache_misses_total[5m]) in PromQL gives you real-time cache miss rate with no log parsing, no latency, and negligible storage.

A Trace Is Not a Metric

Another common mistake:

# Wrong: using a histogram to find which service is slow
request_latency.labels(service="downstream-api").observe(latency)
# This tells you the downstream API is slow
# But it does NOT show you why - is it the network? The DB? A specific query?

A Prometheus histogram tells you that the downstream API is slow. A distributed trace tells you why - it shows you every operation inside that service with its individual timing, the exact SQL queries that ran, the Redis lookups that happened, and the outbound HTTP calls that were made.

Use metrics for alerting. Use traces for root cause analysis.

The Observability Stack Used in This Module

All tools in this module are open source and run locally with docker-compose:

Tool	Role	Port
`structlog`	Structured logging library	(library)
Loki	Log aggregation and storage	3100
Promtail	Log shipper (files → Loki)	9080
Prometheus	Metrics scraping and storage	9090
Alertmanager	Alert routing and deduplication	9093
Grafana	Metrics and log dashboards	3000
OpenTelemetry Collector	Trace collection and routing	4317/4318
Jaeger	Distributed trace storage and UI	16686
Sentry (self-hosted)	Error tracking	9000

Full docker-compose setup provided in Lesson 01.

Module Lessons

Lesson 01 - Structured Logging

The Python logging module internals, structlog pipeline configuration, correlation IDs via contextvars, JSON formatting, sensitive data masking, log aggregation with Loki, and async non-blocking log handlers. Transforms an unstructured service into one whose logs are instantly searchable.

Key deliverable: A logging_config.py module that any FastAPI service can drop in and immediately produce structured, correlated, JSON logs shipped to Loki.

Lesson 02 - Metrics with Prometheus

The Prometheus data model, all four metric types with real use cases, FastAPI auto-instrumentation, custom application metrics, PromQL for SRE work, Alertmanager rules, and a complete Grafana dashboard JSON.

Key deliverable: A metrics.py module with application-level metrics for a document processing service, 10 real PromQL queries, and 5 production alerting rules.

Lesson 03 - Distributed Tracing

OpenTelemetry Python SDK, auto-instrumentation for FastAPI / SQLAlchemy / Redis / HTTPX, custom spans for business logic, W3C trace context propagation, baggage, sampling strategies, and reading Jaeger waterfall diagrams.

Key deliverable: Full OpenTelemetry setup for a multi-service Python application with context propagation through HTTP, and trace IDs injected into log lines.

Lesson 04 - Error Tracking

Sentry Python SDK, enriching errors with user context and breadcrumbs, custom fingerprinting for error grouping, before_send hooks for sensitive data filtering, release tracking with source maps, and building an error triage workflow.

Key deliverable: A production Sentry configuration that groups errors intelligently, masks PII, and integrates with your release pipeline.

Lesson 05 - Health Checks and Readiness

Kubernetes liveness vs readiness vs startup probes, designing health checks that accurately reflect service health, parallel dependency checks with timeouts, SLOs and error budgets, synthetic monitoring, and health check anti-patterns.

Key deliverable: A complete /liveness, /readiness, and /startup implementation for a FastAPI service with PostgreSQL, Redis, and external API dependencies.

Observability Maturity Model

Before starting, assess where your service sits today:

Level	Name	Characteristics
0	Dark	`print()` statements, no structure, errors discovered by users
1	Basic Logs	`logging.basicConfig()`, some log lines, unstructured text
2	Structured Logs	JSON logs with levels, timestamps, and some context fields
3	Correlated Logs	Request IDs in every log line, logs shipped to aggregator
4	Metrics	Prometheus counters/histograms, dashboards, basic alerts
5	Error Tracking	Sentry with user context, release tracking, error workflows
6	Tracing	Distributed traces, p99 from traces, traces linked to logs
7	Full Observability	SLOs, error budgets, synthetic monitoring, runbooks linked to alerts

Most production Python services in the wild sit at Level 1 or 2. This module takes you to Level 7.

How to Work Through This Module

Each lesson follows the same structure:

Opening incident - a real production failure caused by missing observability
Concepts - the theory, explained through the lens of what the incident needed
Working code - production-grade implementations, not toy examples
Integration - how this pillar connects to the others
Interview Q&A - five questions asked at senior/staff engineering interviews

Run each lesson's code examples locally. By the end of Lesson 03, you will have a fully instrumented Python service with logs, metrics, and traces all running in docker-compose, all visible in Grafana.

Quick Reference: The Golden Signals

Before diving into implementation, here are the four signals every production service must measure (from Google's SRE Book):

Signal	What It Measures	Prometheus Metric Type
Latency	Time to serve a request (success vs error latency separately)	Histogram
Traffic	How much demand is hitting the system	Counter
Errors	Rate of failed requests (5xx, explicit failures, wrong results)	Counter
Saturation	How "full" the service is (CPU, memory, connection pools, queue depth)	Gauge

These four metrics, exposed correctly, will catch 90% of production incidents before users notice them. Lessons 02 through 05 show you how to implement each one properly.

Let's build observable systems.

What You Will Learn​

Prerequisites​

The Incident That Starts Every Observability Story​

Why Observability is Not Just Logging​

Logs: What Happened​

Metrics: How Much / How Often​

Traces: Where Did the Time Go​

The Mistake: Thinking One Pillar Is Enough​

The print() Problem​

A Metric Is Not a Log​

A Trace Is Not a Metric​

The Observability Stack Used in This Module​

Module Lessons​

Lesson 01 - Structured Logging​

Lesson 02 - Metrics with Prometheus​

Lesson 03 - Distributed Tracing​

Lesson 04 - Error Tracking​

Lesson 05 - Health Checks and Readiness​

Observability Maturity Model​

How to Work Through This Module​

Quick Reference: The Golden Signals​